Using Audio Books for Training a Text-to-Speech System
نویسندگان
چکیده
Creating new voices for a TTS system often requires a costly procedure of designing and recording an audio corpus, a time consuming and effort intensive task. Using publicly available audiobooks as the raw material of a spoken corpus for such systems creates new perspectives regarding the possibility of creating new synthetic voices quickly and with limited effort. This paper addresses the issue of creating new synthetic voices based on audiobook data in an automated method. As an audiobook includes several types of speech, such as narration, character playing etc., special care is given in identifying the data subset that leads to a more neutral and general purpose synthetic voice. Part of the work described in this paper was carried out for the participation of our TTS system in the Blizzard Challenge 2013 where the developed TTS system was ranked top in overall impression in one of the two experimental hubs (Chalamandaris et al 2013). The main goal is to identify and address the effect the audiobook speech diversity on the resulting TTS system. Along with the methodology for coping with this diversity in the speech data, we also describe a set of experiments performed in order to investigate the efficiency of different approaches for automatic data pruning. Further plans for exploiting the diversity of the speech incorporated in an audiobook are also described in the final section and conclusions are drawn.
منابع مشابه
Cipher text only attack on speech time scrambling systems using correction of audio spectrogram
Recently permutation multimedia ciphers were broken in a chosen-plaintext scenario. That attack models a very resourceful adversary which may not always be the case. To show insecurity of these ciphers, we present a cipher-text only attack on speech permutation ciphers. We show inherent redundancies of speech can pave the path for a successful cipher-text only attack. To that end, regularities ...
متن کاملL2 Learners’ Lexical Inferencing: Perceptual Learning Style Preferences, Strategy Use, Density of Text, and Parts of Speech as Possible Predictors
This study was intended first to categorize the L2 learners in terms of their learning style preferences and second to investigate if their learning preferences are related to lexical inferencing. Moreover, strategies used for lexical inferencing and text related issues of text density and parts of speech were studied to determine their moderating effects and the best predictors of lexical infe...
متن کاملThe Speect text - to - speech system entry for the Blizzard Challenge 2013
This paper describes the Speect text-to-speech system entry for the Blizzard Challenge 2013. The techniques applied for the tasks of the challenge are described as well as the implementation details for the alignment of the audio books and the text-to-speech system modules. The results of the evaluations are given and discussed.
متن کاملOverview of NIT HMM - based speech synthesis system for Blizzard Challenge 2012
This paper describes a hidden Markov model (HMM) based speech synthesis system developed for the Blizzard Challenge 2012. In the Blizzard Challenge 2012, we focused on a design of contexts for using audio books as training data and duration modeling of silence between sentences for synthesizing paragraphs. It is well known that contextual factors affect speech. We use extended contexts for usin...
متن کاملAutomatic Building of Synthetic Voices from Audio Books
Current state-of-the-art text-to-speech systems produce intelligible speech but lack the prosody of natural utterances. Building better models of prosody involves development of prosodically rich speech databases. However, development of such speech databases requires a large amount of effort and time. An alternative is to exploit story style monologues (long speech files) in audio books. These...
متن کاملReal Time Implementation of Image Recognition and Text to Speech Conversion
This paper introduces an innovative, efficient and realtime cost beneficial technique that enables user to hear the contents of documents/text images instead of reading through them. It combines the concept of Optical Character Recognition (OCR) and Text to Speech Synthesiser (TTS) in MATLAB R2011b. This kind of system enables visually impaired people to interact with computers effectively thro...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014